17 research outputs found
An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild
Zero-shot learning (ZSL) methods have been studied in the unrealistic setting
where test data are assumed to come from unseen classes only. In this paper, we
advocate studying the problem of generalized zero-shot learning (GZSL) where
the test data's class memberships are unconstrained. We show empirically that
naively using the classifiers constructed by ZSL approaches does not perform
well in the generalized setting. Motivated by this, we propose a simple but
effective calibration method that can be used to balance two conflicting
forces: recognizing data from seen classes versus those from unseen ones. We
develop a performance metric to characterize such a trade-off and examine the
utility of this metric in evaluating various ZSL approaches. Our analysis
further shows that there is a large gap between the performance of existing
approaches and an upper bound established via idealized semantic embeddings,
suggesting that improving class semantic embeddings is vital to GZSL.Comment: ECCV2016 camera-read
Weakly Supervised Content Selection for Improved Image Captioning
Image captioning involves identifying semantic concepts in the scene and
describing them in fluent natural language. Recent approaches do not explicitly
model the semantic concepts and train the model only for the end goal of
caption generation. Such models lack interpretability and controllability,
primarily due to sub-optimal content selection. We address this problem by
breaking down the captioning task into two simpler, manageable and more
controllable tasks -- skeleton prediction and skeleton-based caption
generation. We approach the former as a weakly supervised task, using a simple
off-the-shelf language syntax parser and avoiding the need for additional human
annotations; the latter uses a supervised-learning approach. We investigate
three methods of conditioning the caption on skeleton in the encoder, decoder
and both. Our compositional model generates significantly better quality
captions on out of domain test images, as judged by human annotators.
Additionally, we demonstrate the cross-language effectiveness of the English
skeleton to other languages including French, Italian, German, Spanish and
Hindi. This compositional nature of captioning exhibits the potential of
unpaired image captioning, thereby reducing the dependence on expensive
image-caption pairs. Furthermore, we investigate the use of skeletons as a knob
to control certain properties of the generated image caption, such as length,
content, and gender expression
What You See is What You Read? Improving Text-Image Alignment Evaluation
Automatically determining whether a text and a corresponding image are
semantically aligned is a significant challenge for vision-language models,
with applications in generative text-to-image and image-to-text tasks. In this
work, we study methods for automatic text-image alignment evaluation. We first
introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets
from both text-to-image and image-to-text generation tasks, with human
judgements for whether a given text-image pair is semantically aligned. We then
describe two automatic methods to determine alignment: the first involving a
pipeline based on question generation and visual question answering models, and
the second employing an end-to-end classification approach by finetuning
multimodal pretrained models. Both methods surpass prior approaches in various
text-image alignment tasks, with significant improvements in challenging cases
that involve complex composition or unnatural images. Finally, we demonstrate
how our approaches can localize specific misalignments between an image and a
given text, and how they can be used to automatically re-rank candidates in
text-to-image generation
MaXM: Towards Multilingual Visual Question Answering
Visual Question Answering (VQA) has been primarily studied through the lens
of the English language. Yet, tackling VQA in other languages in the same
manner would require a considerable amount of resources. In this paper, we
propose scalable solutions to multilingual visual question answering (mVQA), on
both data and modeling fronts. We first propose a translation-based framework
to mVQA data generation that requires much less human annotation efforts than
the conventional approach of directly collection questions and answers. Then,
we apply our framework to the multilingual captions in the Crossmodal-3600
dataset and develop an efficient annotation protocol to create MaXM, a
test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple,
lightweight, and effective approach as well as benchmark state-of-the-art
English and multilingual VQA models. We hope that our benchmark encourages
further research on mVQA.Comment: EMNLP 2023 (Findings).
https://github.com/google-research-datasets/max